Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Imbalanced classification algorithm based on improved semi-supervised clustering
Yu LU, Lingyun ZHAO, Binwen BAI, Zhen JIANG
Journal of Computer Applications    2022, 42 (12): 3750-3755.   DOI: 10.11772/j.issn.1001-9081.2021101837
Abstract329)   HTML8)    PDF (706KB)(118)       Save

Imbalanced classification is one of the research hotspots in the field of machine learning, where oversampling increases minority samples through repeated extraction or artificial synthesis to rebalance the dataset. However, most of the existing oversampling methods are based on the original data distribution, and are difficult to reveal more dataset distribution characteristics. To address the above problem, firstly, an improved semi-supervised clustering algorithm was proposed to mine the data distribution characteristics. Secondly, based on the results of semi-supervised clustering, the highly-confident unlabeled data (pseudo-labeled samples) was selected from minority-class clusters to join into the original training set. In this way, in addition to rebalancing the dataset, the distribution characteristics obtained by semi-supervised clustering was able to be used to assist the imbalanced classification. Finally, the results of semi-supervised clustering and classification were fused to predict the final labels, which further improved the model performance of imbalanced classification. With G-mean and Area Under Curve (AUC) selected as evaluation indicators, the proposed algorithm was compared with seven oversampling-/undersampling-based imbalanced classification algorithms, such as TU (Trainable Undersampling) and CDSMOTE (Class Decomposition Synthetic Minority Oversampling TEchnique) on 10 public datasets. Experimental results show that compared with TU and CDSMOTE, the proposed algorithm has the average AUC increased by 6.7% and 3.9% respectively, the average G-mean improved by 7.6% and 2.1% respectively. At the same time, the proposed algorithm achieves the highest average results on both evaluation indicators than all the algorithms to be compared. It can be seen that the proposed algorithm can effectively improve the imbalanced classification performance.

Table and Figures | Reference | Related Articles | Metrics